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Impact  of  Image  Quality  on  Machine  Print 
Optical  Character  Recognition 

Michael  D.  Garris,  Stanley  Janet,  and  William  W.  Klein 
National  Institute  of  Standards  and  Technology 


ABSTRACT 

The  National  Instimte  of  Standards  and  Technology  (NIST)  is  in  the  process  of  setting  up  a new  series  of 
conferences  named  the  Metadata  Text  Retrieval  Confaences  (METTREC).  They  will  focus  on 
evaluating  two  critical  technologies:  document  conversion  using  optical  character  recognition  (OCR)  and 
information  retrieval  (IR).  Large  collections  of  document  images  labeled  with  correa  recognition  and 
retrieval  responses  are  needed  to  measure  performance.  Currently,  the  production  of  these  materials  is 
extremely  expensive.  NIST  is  developing  a semi-automated  truthing  tool  that  will  help  reduce  the  cost  of 
data  preparation  and  enable  evaluations  to  scale  up.  To  accomplish  this,  current  OCR  technology  is 
needed  to  produce  an  initial  text  to  image  alignment.  This  paper  describes  a small  experiment  in  which 
three  different  vendor  products  (two  Windows  NT/95-based  and  one  UNIX-based)  are  evaluated  across 
three  sets  of  document  images  containing  progressively  decreasing  print  and  image  quality.  The 
evaluation  images  contain  subjectively  selected  pages  from  the  1994  Federal  Register.  Results 
demonstrate  the  impaa  of  degrading  print  and  image  quality  with  rq)orted  charaaer  recognition  error 
rates  ranging  from  1%  to  as  high  as  74%. 

Keywords:  image  quality,  information  retrieval,  IR,  machine  print,  METTREC,  optical  character 
recognition,  OCR,  page  decomposition,  technology  evaluation 

1.0  INTRODUCTION 

The  National  Institute  of  Standards  and  Technology  (NIST)  is  setting  up  a new  series  of  conferences 
named  the  Metadata  Text  Retrieval  Conferences  (METTREC).  They  will  focus  on  evaluating  two  critical 
technologies:  document  conversion  using  optical  charaao’  recognition  (OCR)  and  information  retrieval 
(IR).  Evaluations  will  be  designed  to  investigate  the  impaa  of  machine  recognition  orors  on  information 
retrieval  and  to  daamine  what  interfaces  are  appropriate  to  integrate  the  two  technologies. 

To  support  these  evaluations,  large  training  and  testing  sets  of  documents  must  be  aeated.  The  Federal 
Register  (FR)  for  1994  has  been  chosen  to  be  the  initial  source  of  documents  for  METTREC  because  it  is: 
(1)  a complae  set  of  documents  within  the  public  domain;  (2)  a large  collection  containing  over  250 
issues  consisting  of  over  67,000  pages  of  information;  (3)  a structured  document  sa  whose  hierarchy 
contains  maadata;  (4)  a collection  of  pages  containing  significant  variations  in  print  and  image  quality; 
and  (5)  a sa  of  documents  for  which  the  text  for  the  entire  collection  is  stored  in  electronic  files. 

To  condua  METTREC  evaluations,  each  FR  Image  page  must  be  matched  with  its  corresponding  text  to 
genaate  the  "ground  truth."  The  ground  truth  represents  the  correa  text  an  OCR  system  should 
recognize  and  that  an  IR  system  should  retrieve.  Text  for  each  day’s  issue  of  the  FR  has  been  provided  by 
the  Government  Printing  Office  (GPO)  and  is  stored  in  electronic  files,  but  unfortunately  the 
correspondence  of  the  text  to  exaa  pages  within  an  issue  is  not  recorded.  Ideally,  we  would  like  to  know 
the  image  position  of  every  word  on  a page. 

We  are  currently  working  on  a semi-automated  process  where  the  ground  truth  can  be  daived  in  an 
effective  and  efficient  way.  We  have  the  images,  and  we  have  the  correa  text  ova  a range  of  images. 
Our  approach  wHl  use  OCR  to  genaate  a "noisy"  text  to  image  correspondence.  Dynamic  string 
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alignment  will  then  be  used  to  match  the  correa  GPO  text  to  the  noisy  OCR  text.  For  each  page  image, 
ground  truth  will  be  automatically  produced  where  the  OCR  is  sufficiently  accurate.  The  correspondence 
of  remaining  passages  of  unaligned  text  will  need  to  be  assigned  manually. 

The  higher  the  quality  of  recognition,  the  greater  the  yield  of  automatically  generated  ground  truth. 
Techniques  like  this  are  needed  to  lower  the  per-page  cost  of  geno-ating  test  coUeaions,  which  in  turn 
permit  evaluations  to  scale  up.  As  a result,  this  approach  places  a premium  on  OCR  accuracy.  To  move 
forward  in  the  development  of  this  approach,  we  needed  an  OCR  technology  capable  of  processing  FR 
images  with  reasonably  low  rate  of  error. 

In  a prior  effort  that  used  NIST  OCR  technology  to  recognize  FR  page  numbers[l],  we  observed  a field 
error  rate  of  12%.  Due  to  time  constraints,  adapting  and  developing  our  own  in-house  technology  was  too 
time-consuming  and  costly  for  recognizing  the  entire  page.  As  a result,  we  decided  to  evaluate  three 
commercially  available  OCR  products.  It  has  been  stated  that  a 1 % OCR  error  rate  can  only  be  attained 
by  commercial  OCR  products  whenever  “a  printed  document  is  a fixed,  typed  original  or  a clean  copy,  in 
a simple  paragraph  format  in  a common  typing  font”[2].  Although  this  reference  is  from  1990,  this  stiU 
seems  to  be  an  accurate  summary  of  the  state  of  OCR.  The  pages  of  the  FR  certainly  do  not  conform  to 
these  constraints,  so  we  designed  a small  experiment  to  help  determine  the  level  of  OCR  performance  that 
can  be  achieved.  To  accomplish  this,  three  products  were  evaluated  by  focusing  on  their  character  level 
errors.  Other  sources  of  error,  such  as  page  decomposition  and  the  processing  of  non-text  items,  were 
excluded  from  this  evaluation 

The  design  of  the  evaluation  is  presented  in  Section  2.0;  results  are  rq)orted  in  Section  3.0,  and 
conclusions  are  drawn  in  Section  4.0. 

2.0  DESIGN  OF  THE  EXPERIMENT 


2.1  Image  Scanning  and  Quality  Verification 

Before  scanning  the  FR,  its  pages  were  cut  from  their  bookbinding.  This  resulted  in  page  sizes  of 
approximately  20cm  (8")  by  28cm  (1 1").  Image  scanning  was  poformed  using  a Kodak  923^  scanner  to 
output  a compressed  bitonal  image  in  the  tagged  image  format  (TIFF™[3]).  The  scanner  did  not  apply 
any  special  adaptive  image  enhancement  to  the  grayscale  image  before  converting  it  to  a bitonal  TIFF. 
Approximately  67,000  FR  pages  were  scanned. 

Reference  [1]  documents  the  process  used  to  validate  the  entire  collection  of  FR  images.  With  the  high- 
speed batch  scanning  of  thousands  of  pages,  a surprisingly  large  number  of  diversified  errors  wo'e 
deteaed.  Some  errors  appeared  to  be  caused  by  the  machinery  and  others  by  the  operator.  Images  of 
pages  were  found  to  be  missing,  assigned  to  the  wrong  file,  truncated,  corrupted,  skewed,  scanned  at  the 
wrong  density,  etc.  To  ensure  that  each  file  contained  a valid  image,  the  following  image  quality  checks 
were  performed: 

• Resolution  = 15.75  pixels/mm  (400  pixel/in) 

• Compressed  CCITT  Group  4 image[4]  file  size  ^ 30  kilobytes  (Kb) 

• Width  < 4000  pixels 


^ Specific  hardware  and  software  products  identified  in  this  paper  were  used  in  order  to  adequately  support  the 
evaluation  of  the  technology  described  in  this  document,  hi  no  case  does  such  identificaticHi  imply  recommendation 
or  endorsement  by  NIST,  nor  does  it  imply  that  the  equipment  is  necessarily  the  best  available  for  the  purpose. 
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4200  pixels  < Height  ^ 4900  pixels 


2.2  Federal  Register  Page  Format 

GPO  prints  a new  issue  of  the  FR  each  workday  of  the  year.  An  issue  is  typically  published  in  a single 
book  and  contains  three  distinct  sections:  a prefix,  body,  and  appendix.  Within  the  body  section,  "detail" 
pages  elaborate  and  provide  a record  of  the  meeting  notices,  proposals,  and  transactions  of  the  United 
States  government  for  the  day.  Detail  pages  comprise  95%  of  the  total  FR  page  volume;  therefore, 
recognition  performance  from  this  type  of  page  was  the  focus  for  this  experiment. 

Detail  pages  like  the  one  shown  in  Figure  1 are  printed  in  mostly  9 point  Vermilion  font  and  contain  a 
page  heading  that  includes  a text  banner  printed  above  two  horizontal  lines.  The  text  banner  contains 
information  that  identifies  the  document,  the  volume,  the  date,  the  topic,  and  a page  number.  All  detail 
pages  in  this  experiment  contain  three  columns  of  information.  Each  page  column  may  contain  text, 
graphics,  and/or  tabular  information  that  elaborate  the  transactions  of  the  government.  Since  the  primary 
focus  of  this  expCTiment  was  to  evaluate  OCR  character  error  rates,  we  excluded  FR  images  containing 
graphics  and  tabular  informatioa 

2.3  Image  Classification  Criteria 

The  FR  is  printed  on  newspaper- quality  recycled  papa".  The  paper  is  light-weight  and  relatively 
absorbent  so  that  printing  ink  frequently  bleeds  through  the  page.  The  quality  of  the  typed  print  also 
fluctuates  significantly.  Patches  of  lightly  printed  charaaers  and  heavily  smudged  charaao'S  are  often 
observed  on  the  same  page.  Poor  quality  paper  and  high-speed  printing  contribute  directly  to  varied 
image  quality,  which  in  turn  directly  impacts  the  rate  of  OCR  errors. 

To  study  the  impact  of  these  faaors,  three  categories  of  image  quality  wa-e  defined  for  our  evaluation: 
good,  bad,  and  ugly.  These  are  fairly  subjective  categorizations  for  which  images  w^e  viewed  on  a 
50.8cm  (20”)  workstation  display  and  judged  based  on  the  following  critaia: 

• Good:  Illustrated  in  Figure  2,  a good  image  contains  a minimal  amount  noise,  is  easily  readable, 
and  has  print  quality  that  causes  a minimal  amount  of  charaaers  to  either  touch  or  be  brokea 

• Bad:  Illustrated  in  Figure  3,  a bad  image  contains  a moderate  amount  of  noise,  is  readable  by  a 
humaa  and  has  print  quality  that  causes  many  charaaers  to  eitha  touch  or  be  brokea 

• Ugly:  Illustrated  in  Figure  4,  an  ugly  image  contains  an  excessive  amount  of  noise,  contains 
sections  that  are  illegible  by  a human,  and  has  print  quality  that  causes  many  charaaas  to  eitha 
touch  or  be  brokea 

Since  we  had  to  manually  genaate  and  verify  the  ground  truth  for  each  image,  we  limited  our  expaiment 
to  five  rqjresentative  pages  for  each  class  of  image  quality.  A typical  FR  page  contains  1100  to  1200 
words  totaling  more  than  6,000  charaaas.  In  all,  15  pages  wae  used,  providing  ova  70,000  charaaas 
from  which  OCR  charaaa  error  rates  could  be  daived  to  compare  the  three  OCR  products. 
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.specified  assessment  rate  to  cover  such 
expenses  vvill  tend  to  efiectuate  the 
declared  policy  of  the  Act. 

It  is  fuither  found  that  good  cause 
exists  for  not  postponing  the  effective 
date  of  this  action  until  30  days  after 
publication  in  the  Federal  Register  1 3 
U.S.C.  553)  because  the  Committee 
needs  to  have  sufficient  funds  to  pay  its 
expenses  which  are  incurred  on  a 
continuous  basis.  The  1994—95  fiscal 
year  for  the  program  began  July  1, 1994. 

marketing  order  requires  that  the 
rate  of  assessment  apply  to  ail 
assessable  papayas  handled  during  the 
fiscal  year.  In  addition,  handlers  are 
aware  of  this  action  whidi  was 
recommended  1^  the  Committee  at  a 
public  meeting  and  published  in  the 
Federal  Register  as  an  interim  final  rule. 
No  comments  were  received  concerning 
the  interim  final  rule  that  is  adopted  in 
this  action  as  a final  rule  without 
change. 

List  of  Subjects  in  7 CFR  Part  92S 

Marketing  agreements.  Papayas, 
Reporting  arxl  recordkeeping 
requirements. 

For  the  reasons  set  forth  in  the 
preamble,  7 CFR  pert  928  is  amended  as 
follows: 

PART  928— PAPAYAS  SHOWN  IN 
HAWAII 

Accordingly,  the  interim  final  rule 
amending  7 CFR  part  928  which  was 
published  at  59  FR  33898  on  July  1, 
1994^  is  adopted  as  a final  rule  without 
change. 

Dated;  August  25.  1944. 

Eric  M.  Fotman, 

Acting  Deputy DirectoK  Fruit  and  Vegetable 
■Division. 

IFR  Doc.  94-21636  Fill'd  6-31-94:  8:45  anil 
BILLMG  cooe  Uta-es-p 


7 CFR  Part  947 
(Docket  No.  FV94-M7-2FtR] 

OregoivC^tomla  Potatoes;  Expenses 
and  Assessment  Rate 

agency:  Agricultural  Marketing  Service. 
USDA- 

ACTION:  Final  rule. 

8UUMARV:  The  Department  of 
Agriculture  (DepjutmentJ  is  adopting  as 
D final  rule,  without  change,  the 
provisions  of  an  interim  final  rule  that 
authorized  expenses  and  established  an 
assessment  rate  that  will  generate  funds 
to  pay  those  expenses.  Authorization  of 
this  ^dget  enaoles  the  Oregon- 
Califomia  Potato  Committee 
(Committee)  to  incur  expenses  that  are 


•xiasonable  and  necessary  to  administer 
the  program.  Funds  to  administer  this 
program  are  derived  from  a.s.sessnicnts 
on  handlers. 

EFFECTIVE  OATES:  July  1. 1994.  through 
June  30. 1995. 

FOR  FURTHER  INFORMATION  CONTACT: 
Martha  Sue  Clark.  Marketing  Order 
Administration  Branch,  Fruit  and 
Vegetable  Division,  AMS,  USDA,  P.O. 
Box  964S6,  room  2523-S.  Washington, 
DC  20090-6456.  telephone  202-720- 
9918,  or  Teresa  L.  Hutchinson, 
Northwest  Marketing  Field  Office.  Fruit 
and  Vegetable  Division,  AMS,  DSDA, 
Green- Wyatt  Federal  Building,  room 
369. 1220  Southwest  Third  Avenue, 
Portland,  OR  97204,  telephone  50.3- 
326-2724. 

SUPPt.EMENTARY  INFORMATION:  This  rule 
is  issued  under  Marketing  Agreement 
No.  114  and  Order  No.  947,  tetb  as 
amended  (7  CFR  part  947),  regulating 
the  handling  of  Irish  potatoes  grown  in 
Oregon-Califoniia.  The  marketing 
agreement  and  order  are  effective  under 
the  AgricuhuTal  Marketing  Agreement 
Act  of  1937,  as  amended  (7  U-S.C  601- 
674),  hereinafter  referred  to  as  the  Act. 

The  Department  is  issuing  this  rule  in 
conformance  with  Executive  Order 
12666. 

This  rule  has  been  reviewed  under 
Executive  Order  12778,  Civil  Justice 
Reform.  Under  the  marketing  order  now 
in  effect  Oregon-Califomia  potato 
handlers  are  subject  to  assessments. 
Funds  to  administer  the  Oregon- 
California  potato  order  are  di^ved  from 
such  assessments.  It  is  intended  that  the 
assessment  rate  as  issued  herein  will  be 
applicable  to  all  assessable  potatoes 
during  the  1994-95  fiscal  jieriod.  which 
began  July  1, 1994,  and  ends  June  30, 
1905.  This  ^al  rule  will  not  preempt 
any  State  or  local  ]aws..regiilations,  or 
policies,  unless  they  present  an 
irreconcilable  conflict  with  this  rule. 

The  Act  provides  that  administrative 
proceedings  must  be  exhausted  before 
parties  may  file  suit  in  court.  Under 
section  8c(15j{A)  of  the  Act,  any  handler 
subject  to  an  order  may  file  with  the 
Secretary  a petition  stating  that  the 
order,  any  provision  of  the  order,  or  any 
obligation  imposed  in  coimection  with 
the  order  is  not  in  accordance  with  taw 
and  requesting  a modification  of  the 
order  or  to  he  exempted  therefrom.  Such 
handler  is  afforded  the  opportunity  for 
a hearing  on  the  petition.  After  the 
hearing  the  Secretary  would  rule  on  the 
petition.  The  Act  provides  that  the 
district  court  of  the  United  States  in  any 
district  in  which  the  handier  is  an 
inhabitant,  or  has  his  or  her  principal 
place  of  business,  has  jurisdiction  in 
equity  to  review  the  Secretary’s  ruling 


on  the  petition,  provided  a bill  in  equity 
is  filed  not  later  than  20  days  after  the 
date  of  the  entry  of  the  ruling. 

Pursuant  to  the  requirements  set  forth 
in  the  Regulatory  Fki^bilily  Act  (RFA). 
the  Administrator  of  the  A^cultural 
Marketing  Service  (AMS)  has 
consider^  the  economic  impact  of  this 
rule  on  small  entities. 

The  purpose  of  the  RFA  is  to  fit 
regulatory  actions  to  the  scale  of 
business  subject  to  such  actions  in  order 
that  small  businesses  will  noti>e  unduly 
or  disproportionately  burdened. 
Marketing  orders  issued  pursuant  to  the 
Act.  and  the  rules  issued  thereunder,  are 
unique  in  that  they  are  brought  about 
through  group  action  of  essentially 
small  entities  acting  on  their  own 
tehalf.  Thus,  both  statutes  have  small 
entity  orientation  and  compatibility. 

There  are  approximateW  550 
producers  ofOregon-California  potatoes 
under  this  marketing  order,  and 
approximately  40  handlers.  Small 
agricultural  producers  have  been 
defined  by  the  Small  Business 
Administration  (13  CFR  121.601)  a.s 
those  having  annual  receipts  of  less  than 
S500.000,  and  small  agric^tural  service 
firms  are  defined  as  those  whose  annuo! 
receipts  are  less  thw  $5,000,000.  The 
mpiority  of  Oreram-Callfornia  potato 
producers  and  handlers  may  be 
classified  as  small  entities. 

The  budget  of  expenses  for  the  1994— 
95  fiscal  period  was  prepared  by  the 
Oregon-CalifOrma  Potato  Committee,  the 
agency  responsible  for  local 
administration  of  the  marketing  order, 
and  submitted  to  the  Department  for 
approval.  The  members  of  the  ’ 
Committee  are  producers  and  handlers 
of  Oregon-California  potatoes.  They  are 
femiliar  with  the  (^mmittee’s  needs  and 
with  the  costs  of  gCM>ds  and  services  in 
their  local  area  and  are  thus  in  a 
position  to  formulate  an  appropriate 
budget.  The  budget  was  formulated  and 
discussed  in  a public  meeting.  Thus,  all 
directly  affected  persons  have  had  an 
opportunity  to  participate  and  provide 
input. 

The  assessment  rate  recommended  by 
the  Committee  was  derived  by  dividing 
anticipated  expenses  by  expected 
shipments  of  Qregon-Califbmia 
potatoes.  Because  that  rate  will  be 
applied  to  actual  shipments,  it  must  be 
established  at  a rate  that  will  provide 
sufficient  income  to  pay  the 
Committee’s  expenses. 

The  Committee  unanimously 
recommended  a budget  of  $45,100, 
$1,500  more  than  last  season.  Incrrases 
in  expenditures,  which  include  $1*50  for 
the  Committee's  annual  report,  $50  for 
the  Committee’s  audit.  $1,000  for 
Inspection  fees.  $500  for  investigation 


Figure  1.  Detail  page  from  the  Federal  Register, 
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All  States  and  Territories  except 
AJabama,  Connecticut,  Hawaii,  Alaska. 
Idaho,  Kansas,  Louisiana.  Minnesota, 
Montana.  Nebraska,  Oklahoma,  Oregon, 
Pennsylvania,  South  Dakota,  Virginia, 
Washington,  American  Samoa  and 
Palau  have  elected  to  participate  in  the 
Executive  Order  process  and  have 
established  Single  Points  of  Contact 
(SPOCs).  Applicants  from  these  18 
jurisdictions  need  take  no  action 
regarding  Executive  Order  12372. 
Applicants  for  projects  to  be 
administered  by  Federally-recognized 
Indian  Tribes  are  also  exempt  from  the 
requirements  of  E.0. 12372.  Otherwise, 
applicants  should  contact  their  SPOCs' 
as  soon  as  possible  to  alert  them  of  the 
prospective  application  and  to  receive 
any  necessary  instructions.  Applicants 
must  submit  any  required  material  to 
the  SPOCs  as  soon  as  possible  so  that 
the  program  oHice  can  obtain  and 
review  SPOC  comments  as  part  of  the 
award  process.  It  is  imperative  that  the 
applicant  submit  all  required  materials, 
if  any,  to  the  SPOC  and  indicate  the  date 
of  this  submittal  (or  the  date  of  contact 
if  no  submittal  is  required)  on  the 
Standard  Form  424,  item  16a. 

Under  45  CFR  100.8(a)(2),  a SPOC  has 
60  days.from  the  application  deadline 
date  to  comment  on  proposed  new  or 
competing  continuation  awards. 

SPOCs  are  encouraged  to  eliminate 
the  submission  of  routine  endorsements 
as  official  recommendations. 
Additionally,  SPOCs  are  requested  to 
differentiate  clearly  between  mere 
advisory  comments  and  those  official 
State  process  recommendations  which 
they  intend  to  trigger  the  “accommodate 
or  explain”  rule. 

When  comments  are  submitted 
directly  to  ACF,  they  should  be 
addreraed  to:  Department  of  Health  and 
Human  Services,  Administration  for 
Children  and  Families,  Division  of 
Discretionary  Grants,  6th  Floor,  QFMJ 
DDG,  370  L’Enfant  Promenade  SW.. 
Washington,  DC  20447. 

A list  of  Single  Points  of  Contact  for 
each  State  and  Territory  is  included  as 
appendix  A of  this  announcement. 

Applicable  Regulations 

Applicable  HHS  regulations  will  be 
provided  to  grantees  upon  awards. 

Post-Award  Requirements — Records 
and  Reports 

Grantees  are  required  to  file  Financial 
Status  (SF-269)  on  a semi-annual  basis 
and  Program  Progress  Reports  on  a 
quarterly  basis.  F\mds  shall  be 
accounted  for  and  reported  upon 
separately  from  all  other  grant  activities. 
Successful  applicants  for  micro- 
enterprise  development  projects  will  be 


given  spedfic  instructions  by  ACF, 
following  the  award  of  the  grant,  for 
reporting  grant  performance  and  loan 
portfolio  information. 

The  official  receipt  point  for  all 
reports  and  correspondence  is  the 
Division  of  EHscretionary  Grants.  The 
original  copy  of  each  report  shall  be 
submitted  to  the  Grants  Management 
Specialist,  Department  of  Health  and 
Human  Services,  Administration,  for 
Children  and  Families,  Division  of 
Discretionary  Grants,  6th  Floor,  OFM/ 
DDG,  370  LTixifant  Promenade  SW., 
Washington,  DC  20447.  A copy  should 
be  sent  simultaneously  to  the  Division 
of  Operations,  ORR.  Tde  mailing 
address  is:  Office  of  Refugee 
Resettlement,  Division  of  Operations, 
Aerospace  Bullding.  Sixth  Floor,  370 
L’Enfant  Promenade,  SW.,  Washington, 
DC  20447. 

The  Ena!  Financial  and  Program 
Progress  Reports  shall  be  due  90  days 
after  the  project  period  expiration  date 
or  termination  of  grant  support. 

Although  OKR  does  not  expect  the 
proposed  components/projects  to 
include  evaluation  activities,  it  does 
expect  grantees  to  maintain  adequate 
records  to  track  and  report  on  project 
outcomes  and  expenditures  by  budget 
line  item. 

The  Catalog  of  Federal  Domestic 
Assistance  (CFDA)  number  assigned  to 
this  announcement  is  93.576. 

Dated;  May  12. 1994. 

Lavinia  Limoo, 

Director.  Office  of  Refugee  Resettlement. 

Appendix  A 

Executive  Order  12372 — State  Single  Points 

of  Contact 

Arizona 

Mrs.  janica  Dunn,  ATTN:  Arizona  State 
Clearinghouse,  3800  N.  Central  Avenue, 
14th  floor.  Phoenix.  Arizona  85012. 
Telephone  (602)  280-1315. 

Arkansas 

Ms.  Trade  L.  Copeland  Manager,  State 
Clearinghouse,  Office  of  Intergovernmental 
Service.  Department  of  Finance  and 
Administration,  P.O.  Box  3278,  Little  Rock. 
Arkansas  72203,  Telephone  (501)  682- 
1074. 

California 

Mr.  Glenn  Stober,  Grants  Coordinator.  Office 
of  Planning  and  Rasoarch,  1400  Tenth 
Street.  Sacramento,  California  95814. 
Telephone  (916)  323-7480. 

Colorado 

State  Single  Point  of  Contact,  State 
Oearinghousa,  Division  of  Local 
Government,  1313  Sherman  Street,  room 
520,  Denver.  Colorado  80203.  Telephone 
(303)  866-2156. 


Delaware 

Ms.  Francine  Booth.  State  Single  Point  of 
Contact.  Executive  Department  Thomas 
Collins  Building,  Dover,  Delaware  19903. 
Telephone  (302)  736-3326. 

District  of  Columbia 

Mr.  Rodney  T.  Hallman,  State  Single  Point  of 
Contact,  Office  of  Grants  MgmL  and 
Development  717  14th  Street  NW..  suite 
500,  Washington,  D.C  20005,  Telephone 
(202)  727-6551. 

Florida 

Florida  State  Clearinghouse. 
Intergovernmental  Affairs  Policy  Unit, 
Executive  Office  of  the  Governor,  Office  of 
Planning  and  Budgeting,  The  Capitol, 
Tallahassee.  Florida  32399-0001. 
Telephone  (904)  488-8114. 

Georgia 

Mr.  Charles  H.  Badger.  Administrator, 
Georgia  Stale  Clearingbousa.  254 
Washington  Street  SW.,  room  534A, 
Atlanta.  Georgia  30334.  Telephone  (404) 
656-3655. 

Illinois 

Mr.  Steve  Klokkenga,  State  Single  Point  of 
Contact.  Office  of  the  Governor.  107 
Stratton  Building,  Springfield.  Illinois 
62706.  Telephone  (217)  782-1671. 

Indiana 

Ms.  Jean  S.  Blackwell,  Budget  Director,  State 
Budget  Agency.  212  State  House, 
Indianapolis,  Indiana  46204,  Telephone 
(317)  232-5610. 

Iowa 

Mr.  Steven  R McCann.  Division  of 
Community  Progress.  Iowa  Department  of 
Economic  DovelopmenL  200  East  Grand 
Avenue,  Des  Moines,  Iowa  50309, 
Telephone  (515)  281-3725. 

Kentucky 

Mr.  Ronald  W.  Cook.  Office  of  the  Governor. 
Department  of  Local  GovemmenL  1024 
Capitol  Center  Drive,  Frankfort,  Kentucky 
40601.  Telephone  (502]  564-2382. 

Maine 

Ms.  Joyce  Benson,  State  Planning  Office. 
State  House  Station  #38.  Augusta,  Maine 
04333.  Telephone  (207)  289-3261 

Maryland 

Ms.  Mary  Abrams,  Chief.  Maryland  State 
Clearinghouse.  Dejjartment  of  State 
Planning,  301  West  Jbeston  Street, 
Baltimore.  Maryland  21201-2365, 
Telephone  (301)  225-4490 

Massachusetts 

Ms.  Karen  Atone.  State  Clearinghouse. 
Elxecutive  Office  of  Communities  and 
Development.  100  Cambridge  Street,  room 
1803,  Boston.  Massachusetts  02202, 
Telephone  (617)  727-7001 

Michigan 

Mr.  Richard  S.  PasUila.  Director.  Michigan 
Department  of  Commerce.  Lansing. 
.Michigan  48909.  Telephone  (517)  373- 
7356 


Figure  2.  "Good"  Quality  Image 
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to  MSS  MED*  ^leciM  aad  Attain  as  in  other 

toidcoiagtcal  stadias.  Kaasms  for  using  tats 

PC  tHo  pttrawlwant  wvWit  kth 

practicality,  compatabUity  with  other  lasuhs 
ohtmixwd  in  this  spscies  and  tiba  large  amount 
of  background  knowledge  accumutetsd. 

In  euibryoteaclcity  studies  only,  a second 
manunalian  spades  trmditianally  has  been 
required,  diarshbit  being  fte  preferred 
choice  as  a '*nonrodent  **  Reasons  for  using 
rshbits  in  emtqroteoddty  studiet  include  the 
extsonve  badyoond  knowledge  that  has 
accumnlatad.  u well  as  svaQabUity  and 
practicality.  Vfhere  the  cabbit  is  unsuitabb, 
an  altamativa  ncncrodent  cr  a second  rodant 
species  may  be  acceptable  and  should  be 
considered  on  a case^-case  basis  fNote  5). 

2.2  Other  Test  Systems 

Other  test  systenrs  we  coosideied  tobe  any 
developing  mpmnwUpw  and  wnn-matnmi»lian 
cell  systems,  tissuBS,  organs,  or  organism 
cultures  developing  independent^  in  idtro  or 
in  vivo.  Integrated  with  srhole  arriml  studies 
either  for  prhaelQr  selection  within 
homologous  series  or  as  secondary 
investigadaau  to  rtneidate  moefa^sms  of 
action,  theee  systuias  can  provide  Invaluable 
iofmmatkm  ao^  indirectly,  reduce  the 
numbers  of  animals  used  hi  experimentatioD. 
However,  diay  lach  the  complndty  of  tbs 
developmenta]  pmrasses  and  tfaa  dynamic 
interchanga  hstwsan  the  matamil  and  the 
developing  otganisms.  Theta  systems  cannot 
provide  assurance  of  the  hbsena  of  affect  nor 
provide  peripectlve  in  respect  of  risk/ 
exposure.  In  risort.  there  are  no  alternative 
test  systems  to  whole  animals  cunantly 
a veil^le  for  leimdoction  toidcity  testing 
with  the  aims  set  out  in  the  introductioii 
(Note  6J. 

3.  General  Beetmmeoidatfens  Cemceming 
Treatment 

3.1  Dosages 

Selection  of  doeages  is  <me  of  the  most 
criticai  issoBs  in  dsaigii  of  the  reproductive 
toxaoty  study.  Hm  i^bcdca  of  the  high  doae 
should  be  based  on  data  firom  all  arallabie 
studies  (phamacology.  acute  and  duonic 
toxicity  and  kiiwtir.  stndisa.  Note  7).  A 
repeat^  dose  toxicity  study  of  about  2 to  4 
vreaks  duntWm  provides  a cloee 
approxiniattan  to  the  duralioB  of  treatment  in 
segmental  designs  of  reproductive  stndiea. 
When  sufficieni  infonnatioa  is  not  available, 
praliminary  studies  are  adviniile  (see  Note 
4U 

Having  determined  the  high  dosage,  lower 
dosages  should  be  selected  in  a descending 
sequence,  the  intervals  depending  on  kinetic 
end  other  tomcity  facton.  Whilst  it  is 
desirable  to  be  diie  to  determine  a "no 
observed  odvsise  ofiect  level,"  priorily 
should  be  given  to  setting  dosage  interveb 
close  enou^  to  reveal  any  doeage  related 
trends  that  may  be  presrat  (Note  8). 

3.2  Route  and  Prequency  of  Administration 

In  ganual  the  route  or  routes  of 
pdminiptrarinn  ho  (q  tfaose 

intended  for  human  usage.  One  route  of 
substance  administratiaQ  may  be  acceptable 
if  it  can  be  shown  diet  a dmllar  distifbutioa 
(kinetic  profile)  results  from  diSeient  routes 
(NtrteSk 


The  usual  frequency  of  administratiioo  b 
once  daily  butcopsidaratinn  shnnld  be  glveai 
to  use  eimar  more  ficemieat  or  less  frequent 

ndmint^tralinn  f Irtog  Vtiuirir  vaii^llae  istO 

acoouat  (see  also  Note  10). 

3.3  Kinetics 

It  is  prefer^de  to  have  mte  information 
on  kinetics  before  raiitoductioc 

studies  sinca  this  may  suggest  the  need  to 
adjust  choice  of  species,  study  diwtgn,  and 
dosing  schedules.  At  this  time,  the 
Information  need  not  be  sophisticated  nor 
derived  from  pregnant  or  Lactating  animals. 

At  die  time  of  study  evaluation,  frnther 
infonnation  on  kinetics  in  {sagnant  or 
lactating  animals  may  be  required  according 
to  the  results  obtained  (Note  10). 

3.4  Control  Groups 

It  is  recommended  that  control  animals  be 
dosed  with  the  vehicle  at  the  sma  cste  as  lest 
group  nnimais.  Wimthe  vdticle  may  cause 
efEsetaer  afiact  the  action  of  the  test 
substence,  a second  foham*  or  nntrateed) 
cootioi  group  should  be  ooDsidned. 

4.  Proposed  Stady  Designs— Cotnhmation  of 
Studies 

All  available  phaimacolagical.  kintdic.  end 
toxicological  dt^  for  the  test  compound  and 
similar  s»ihitenres  should  be  considered  in 
deciding  dtemort  appropriate  atrategy  and 
choice  of  study  des%n.  It  is  anticipated  that, 
initially,  prafimnee  wiD  be^von  to  rinrign* 
that  do  not  diOartoo  ladiosTly  from  those  of 
established  guidelines  frxrmedicina]  products 
(the  most  probable  option).  Por  most 
medicinal  products,  the  three-study  design 
will  usually  be  adequate.  Other  strategies, 
combinations  of  studies,  and  study  datjgns 
could  be  as  valid  or  more  valid  as  the  "moet 
probable  option**  according  to  drcumstances. 
The  key  factor  is  that,  in  total,  they  leave  no 
ga]ra  betwem  st^n  and  allow  dh^  or 
indirect  evaluation  of  an  st^es  of  die 
reproductive  process  (Note  11). 

Designs  shc^d  be  justified. 

4.1  The  Most  Pidiable  Option 

The  most  probable  optimi  can  be  equated 
to  a cnmhineHoa  of  atgi£sa  fior  eSects  on: 

• Fertility  and  asriy  emfaiyonx: 
devalopnwnt. 

• Pisnatel  and  poelnatal  developmenL 
including  mater  nal  frinotioa.  and 

• EmhrycHfatel  devefopiaent 

4.1.1  Study  of  Pertili^  and  Early  Embryonic 
Developraent  to  Implmtation 

Aim 

To  tost  Cor  toxic  efiacts/disturbenom 
resulting  from  treatment  from  befroe  mating 
(males/femsles)  through  mating  eod 
implantation.  This  comprises  evaluation  of 
stages  A and  B of  the  reproductive  process 
(see  1.2).  Por  females  this  should  detect 
efiects  on  the  oestrous  cyde,  tubal  transport, 
implantatian,  and  developmeoit  of 
preimplantation  ttagn  of  tfaa  enhtyo.  For 
males  it  will  permit  detection  of  fonctkmal 
effects  (e.g..  on  Hbido,  epldidymal  sperm 
matuiatian)  tiat  may  cot  be  detected  by 
histological  examinations  of  the  male 
reproductive  organs  (Note  12). 

AssessaMtet  «f 

• Maturation  of  gametes. 


• Mating  behavior, 

• Pertlli^. 

*•  PreimpUntadon  stages  of  the  embryo, 
and 

• Implantation. 

Animals 

At  least  one  specias,  preferably  rats. 
Number  of  Animals 

The  number  of  animals  per  sax  per  group 
should  be  suEBdent  to  alim  meaningfriJ 
interpretation  of  the  data  (Note  13). 

AdnrinistratMn  Peried 

The  design  assumes  that,  especially  for 
effects  on  spermatogenesis,  use  will  he  made 
of  data  from  repeated  dose  toxicity  studies  of 
at  least  1-month  duratiao.  ProvitM  no  eGfects 
have  been  found  that  preclude  this,  a 
premating  treatment  interval  of  2 weeks  for 
females  ai^  4 weeks  Cor  males  can  be  used 
(Note  12).  Selection  of  the  length  of  the 
premating  adTntnlWrnrina  parinA  gbnnld  be 
Stated  and  justified  (sea  alto  1.1,  pointing  out 
the  need  fim  research).  Treatment  ^ould 
continue  throughout  mating  to  termination  of 
males  and  at  least  throu^  implantBtloo  for 
females.  This  will  permit  evaluation  of 
fonctional  effsets  on  male  Cnrtility 
cannot  be  detected  by  histologic  examination 
in  repeated  doee  toxicity  studiet  and  effects 
on  mating  behavior  in  botii  sexes.  If  data 
from  other  studiee  show  dura  ate  e&cts  on 
weight  or  histologic  appearance  of 
reproductive  oigans  in  males  or  femnlmt,  ca 
if  the  quality  of  examinations  Is  dubious  or 
if  there  are  xx>  data  from  other  stuifies,  then 
a more  campreheosive  study  should  be 
designed  (N^  12). 

Mating 

A mating  ratio  of  1:1  is  advisable  and 
procedures  should  allow  identificatian  of 
both  pareuts  of  a litter  (Note  14). 

Terminal  Sacrifice 

Females  may  be  sacrificed  at  any  point 
after  midpregnancy. 

Males  may  be  sacrificed  at  any  time  afrer 
mating  but  it  isadvissUe  to  snsure 
succe^ful  induction  of  piqgiuncy  before 
taking  such  an  irrevocable  step  (Note  15). 

ObservaticBB 

During  study: 

• Signs  and  raortalities  at  least  once  daily; 

• B<^  tv^ght  and  body  vreight  chang»  at 
least  twice  weridy  (Note  18); 

• Food  intake  M least  once  weekly  (except 
during  mating}; 

• Record  vaginal  smears  daily,  at  least 
during  the  mating  period,  to  determine 
whether  there  are  ^ects  on  mating  at 
precoital  timevand 

• Observations  that  have  proved  of  value  in 
other  toxicity  studies. 

At  tevtninat  intamlnotton- 

• Necropsy  (macroGCopk  examinatiou]  of 
all  adults; 

• Preserve  oigans  with  macroscopic 
findings  for  possible  histological  equation; 
keep  corresponding  oigans  of  sufficient, 
controls  for  compuristm; 

• Preserve  testes,  eplcfidymides,  ovaries 
and  uteri  from  all  animals  for  possible 
histological  examination  and  evaluation  on  a 
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raquiraBBBts  of  amended  section 
llOfaMZ). 

B.  Bart  D Requirements 

Before  Winstxjn-Salem/Fotsyth 
County  lasf  be  rodeeignatad  to 
attainment,  U also  mutt  have  fulfilled 
the  ^plicable  requirements  of  part  0. 
Under  pSit  D,  an  area's  classificatiod 
indicates  the  requirements  to  which  it 
will  be  subject  Subpart  1 of  part  D sets 
forth  the  basic  nonattainment 
requirements  applicable  to  all 
nonattadniaent  areas,  cdassified  as  wall 
as  nonclassifi^>le.  Snbpart  3 of  part  D 
esudrlirbea  additional  tequiramants  for 
nnnatfirlnTnmi  sieas  classified  iindflr 
sectkatlSd^a).  i^yfl^orhSidam  area 
Was  classified  as  inods^e  (See  40 
81,334).  Therefote,  in  order  to  be  ' 
redesignated  to  attaiiunent,  the  State 
must  meat  the  applicable  requirements 
of  subpart  1 -of  part  D,'specifically 
sections  17Z(c)  and  176,  and  the 
requlretnents  of  subpart  3 of  part  D, 
which  became  due  on  or  before  April 
2r.'1994/the  date  the  State  submitted  a 
complete  redssignatian  request  EPA 
interprets  sectian  107(dJ(3){v)  to.mean 
that,  for  a redesignation  request  to  be 
approved,  this  State  ihust  have  met  aU 
requirements  that  become  applicable  to 
the  subject  vea  prior  tb  or  at  time  of  the 
submi^an  of  ^ redssignatian  request 
The  arm  Will  barbmilfjpfaj^  to  the  ' 
CAA'that  tOSMeiquant  to  , 

submissfbn  oflhe  w^tslgrtatlph 
iintiltlte  Teq^st  Is  amoved  (See  f ‘ ' 
section  175A^)y  tfiatf  thdmdesignation 
is  disBpprbrced,  the  Stoferamhihs ' 
obligated  tb  fulfill  those  requirements. 

Bl.  Subpwt  1 of  Port  1>— Section 
172(c)  sets  forth  general  requirements 
applicable  to  all  nonattainment  areas. 
Under  section  172(b).  the  section  172(c) 
requirements  are  applicable  as 
determined  by  the  Administrator  but  no 
later  than  three  years  afiex  an  area  is 
designated  as  nonattainment  Because 
Winston-Salem  was  d wtignated  as  a new 
CO  nnpattatnmapt  arm  on  Juna  6, 1802, 
the  requirements  are  not  due  until  ]xme 
6, 1985.  Tharafore.  thesuhmission  of  a 
New  Source  Review  program  and 
contingem^  measure^  nquisad  under 
172(c)  are  not  yet  due.  The  Region  is, 
however,  in  the  procsM  of  approving 
the  State's  revised  KSR-fegnJation- 
which  includes  CO  nonattainment 
areas.  Upon  redesign^on  of  these  areas 
to  attainment,  the  Prevent  of  Significant 
Deterioration  (PSD)  provisioos 
contained  in  part  C of  title  I are 
applicable.  On  June  12.1975,  December 
30, 1976,  June,19, 1976,  August  7. 1980, 
February  23. 1962,  and  August  15, 1894, 
EPA  approved  revisicms  to  the  State  of 
North  Duolina's  PSD  program  (See  40 


FR  25004,  41  FR  56805,43  FR  263*8, 

45  FR  52676, 47  FR  7836,  59  FR  41708). 

B2.  Subpart  1 of  Part  D*— Section 
176(c)  of  me  CAA  requires  ^tes  to 
revise  thslr  Sfi’a'to  eAablish'critejia  and 
procedmes  to  ensure  that  Federal 
actions,  before  they  are  ta^en,  conioim 
to  the.alr  qualltyplanntnf  go^  in  the 
applicable  SIP.  'Foa  requir^eiit  to 
detarmine  coplbtmity  ap^liea  to 
transportation  plans,  programs  and 
projects  developed,  fand«d  or  approved 
under  TiUe  23  U.S.C  or  the  Fede^ 
Transit  Act  (‘Ixmisportation 
conformity”),  Saction  17^  filler 
provides  tnat'thf  dxifoni^ty'revisions 

conslsti^  withlFedentl  b^^^mity 
regulations  thai  tla  CAA  Eecpiked  EPA 
to  ptoiiiulgate.  Cpngrtm^  provided  for 
the  State  revi^ocstobe  snhmittadone 
year  after  the  date  fbrjHtimuJgatioQ  of 
final  EPA  conteinity  Jegalati<m8.  When 
that  date  passed  without  such 
prornuigation,  EFA's  General  Preamble 
for  the  Implementation  of  Title  I 
informed  States,  that  its  conformity 
regulations  would  eatebHtfa  a submittal 
date  (see  57  FR13498. 13657  (April  16, 
1992)).  ■ ■■  ■ 

EPA  pTomidgafed  final  confoimity 
regulations  oa  November  24, 1683  (58 
FR  62188)  and  November  30, 1963  (58 
FR  63214).  These  ccajformityjulea 
require  tto  tfaie-StatM  adopt  both 
transpcrtlitiQn  andgaheraB  ccmfcrnnity 

SrovisidiDS  in  tbO'SP  for^Oea 
esignatedncgurttainmeHt  OT  subject  to 
• malntehante  plan  approved  Huder 
CAA  sectian  17$A.  Fomifoit  In  $51,396 
of  the  trsBsptntationtionftHimty’rule 
and  $51,851  of  the  general  cmiformity 
ride,  the  State  of  North  Carolina  is 
required  to  submit  a SIP  revision 
containing  tran^>aTtation  conformity 
orltsda  orfo  procedures  cbnsistant  with 
those  est^lished  in  the  Pedmal  rule  by 
Novembs'  25, 1894.  Similarly.  North 
Carolina  is  required  to  submit  a SIP 
revisian  containing  general  conformity 
criteria  and  pioce<hirea  consistent  with 
those  established  in  the  Federal  rule  by 
December  1, 1994.  Because  the 
deadlines  for  these  submittals  have  not 
yet  com*  dne,  they  are  not  applicable 
requirements  (m<^  section 
107(d)(3)(E)(v)  and,  thus,  do  not  effect 
approval  of  tinsredesignatian  request. 

B3.  Subpait,S^ Part  D — Under 
section  187(a)  areas  designated 
nonattainment  for  CO  under  the  ' 
amended  CAA  and  classified  as 
moderate  were  required  to  meet  several  ' 
requirements  by  November  15, 1692. 
North  Carolina  was  required  to  submit 
a 1990  Emission  Inventory.  EPA  has 
reviewed  end  is  approving  in  this  notice 
North  Carolina's  1990  Ease  Year 
Emission  haventory.  The  requirement  to 


mahe  I/M  cotrectioais  are  not  appficable 
to  Forsyth.  County  since  it  was  not  a pre- 
enactment nonattainment  area,  and 
therefore  did  not  have  an  existing 
program  before  the  CAA.  Section  211(m) 
further  required  North  Carohna  to 
submit  an  oxygenated  fuels  regulation 
for  the  Winston-Salem  area.  North 
Carolina  submitted  a complete 
Oxygenaled  Fuel  SIP  on  November  20, 
1992.  The  Oxygoaaated  Fuel  Program  is 
fully  adopted  and  has  been  approved  by 
EPA  (See  59  FR  33683  publisned  on 
June  30. 1994).  Tharefore,  all  Subpart  3 
requirements  that  were  applicable  at  the 
time  the  State  submittsd  its  . 
redesignatian  request  have  been  met. 

3.  FttJfy  ApprvvedSIP  Utidef  Section 
110{k)  of  the  CAA 

Based  on  EPA's  approval  of  SIP 
revisions  under  the  1990  Amendments, 
EPA -has  determined  that  theWinsfon- 
Salam/Forsyth  County  area  has  a fully 
approved  SIP under  section  110(k). 
which  also  meets  the  applioablh 
requirements  of  section  110  and  Part  D 
as  discussed  above. 

4.  Improvement  in  Air  (^aBty  Due  to 
Pemument  and  Enforceable  Measures 

Hie  control  measures  to  which  the 
emission  reductions  are  attributed 
mostly  tnthe  Federal  Motor  Vehicle 
Control  Program  (FMVCPJ.  The  fleet 
turnover  tmdarlhe  FMVCP  produced 
annual  CO  emisaioD  redactions  ef6 
percent.  . - - . . 

In  association  with  its  emission 
inventory  discussed  below,  the  State  of 
North  Quolirm  has  demonstrated  that 
acti^.enforceable  emission  reductions 
are  responsible  for  the  air  quality 
improvement  and  that  tlw  CO  emissions 
in  the  base  year  are  not  artificially  low 
due  to  local  economic  doMmtura.  EPA 
finds  that  the  combination  of  certain 
existing  EPA-approved  SIP  and  federal 
measures  cont^ute  to  the  permanence 
and  enforceability  of  reduction  in 
ambient  CO  levels  that  have  allowed  the 
area  to  attain  the  NAAQS. 

5.  Fully  Approved  Maintenance  Plan 
Under  Section  175A 

Section  175A  of  the  CAA  sets  forth 
the  elements  of  a maintenance  plan  for 
areas  seeking  redeeignatlnn  from 
nesBttalimient  to  attaimsenL  The  plan 
must  demonstrate  continued  attainment 
of  the  appbcahle  NAAQS  for  at  least  ten 
years  after  the  Administrator  Spproves  a 
redesignation  to  attainment  Ei^t  years 
after  the  redesignation,  the  state  must 
submit  a revised  maintenance  plan 
which  demonstrates  attainment  for  the 
ten  years  following  the  initial  ten-year 
period.  To  provide  for  the  possibility  of 
futmo  NAAQS  violations,  the 


Figure  4.  ’’Ugly"  Quality  Image 
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2.4  OCR  Products  Evaluated 


For  this  expo-iment,  we  chose  three  products  that  were  commercially  available.  All  three  products  have  a 
history  of  extensive  participation  in  the  University  of  Nevada  at  Las  Vegas  Information  Science  Research 
Instimte’s  annual  competition  that  evaluated  and  assessed  recognition  accuracy  for  machine  printed 
documents  [5].  Two  of  the  products  execute  on  a Windows  95/NT™^  personal  computer  (referred  to  as 
PC  products  A and  B)  and  one  product  executes  on  a UNIX™^  workstation. 

2.5  Scoring  OCR  Results 

Each  of  the  three  OCR  products  classified  character  segments  as: 

• Accepted:  An  output  charaaCT  classification  confidence  value  equaled  or  exceeded  an  OCR 
product’s,  user  definable,  threshold  confidence  value  for  character  accq)tance.  This  type  of 
classification  is  not  highlighted  or  marked  and  would  not  be  presented  to  a reject  repair  operator 
for  adjudicatioa 

• Rejected:  An  output  charaao"  classification  confidence  value  was  below  the  OCR  product’s, 
usCT  definable,  threshold  value  for  character  acceptance.  This  type  of  classification  is 
highlighted  and  presented  in  context  to  a rejea  rq)air  operator  for  adjudicatioa 

• Unrecognized:  An  OCR  product  cannot  classify  the  segmented  area  with  enough  confidence  to 
output  an  ASCn  rqjresentation  of  it.  Instead,  it  outputs  a user  definable,  unrecognizable 
charaaCT  symbol;  the  unrecognizable  charaaer  symbol  is  usually  a or  a “?”.  Typically,  the 
segment  associated  with  this  type  of  classification  is  highlighted  and  presented  in  context  to  a 
reject  repair  operator  for  adjudicatioa 

AU  the  produas  recognized  and  output  OCR  charactCT  results  using  the  ISO  8859/1  charaaer  sa.  We 
observed  that  the  two  PC  produas  correctly  recognized  the  “?”  and  (charaaers  that  are  used  to  denote 
an  unrecognized  charaaer).  The  UNIX  produa  did  not  recognize  the  charaaer  at  aH. 

This  evaluation  focused  on  raw  charaaer  classification  and  did  not  use  confidence  thresholds.  Every 
charaaer  reported  was  scored  without  any  rejeaioa  As  a result,  charaaer  classifications  were  either 
scored  as  correa  or  as  an  oror. 

A modified  version  of  the  University  of  Washington  Scoring  Package[6]  written  by  Su  Chen  was  used. 
The  primary  enhancement  to  this  software  was  the  addition  of  word  level  scoring  (not  reported  here).  The 
scoring  package  dynamically  aligns  n output  OCR  result  line  strings  with  m reference  line  strings;  and 
thea  for  each  pair  of  matching  OCR  and  reference  lines,  it  aligns  the  charaaas  within  the  lines  and 
scores  the  results.  It  reports  charaaer,  word,  and  line  accuracy  measurements. 

We  scored  OCR  results  using  truth  data  sas  that  wae  in  reading  order.  As  stated  earlier,  these  files  had 
to  be  prepared  manually.  The  electronic  files  genaated  by  GPO  contain  a tagged  text  representation  from 
which  the  print  copy  of  each  FR  book  is  typesa;  however,  specific  page  number  identifiers  and 
boundaries  are  not  included.  A truth  file  for  each  page  was  manually  generated  and  aoss-checked  by 
viewing  the  page's  image  and  extracting  the  corresponding  text  from  the  GPO  file.  The  text  was  then 
edited  to  correspond,  line  for  line,  with  the  image  page  content.  This  process  was  manually  intense, 
requiring  approximately  20  minutes  per  page  to  prepare.  It  took  a clerk/typist  50  minutes  to  type  an  entire 
page  from  saatch,  so  starting  from  the  GPO  file  was  less  expensive.  Due  to  time  constraints,  we  included 
only  15  pages  in  the  evaluation  sa. 
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In  future  METTREC  evaluations,  we  anticipate  relying  heavily  on  word-level  scores.  However,  for  this 
small  evaluation,  we  only  compare  OCR  error  rate  where: 


ErrorRate  = 1 - 


# correct 

# objects 


and  ^correct  is  the  total  number  of  correctly  recognized  characters  and  ^objects  is  the  total  number  of 
charaaers  to  be  recognized. 


3.0  EXPERIMENTAL  RESULTS 


3.1  Page  Decomposition 

Before  presenting  the  charaaer  recognition  results,  a brief  discussion  of  page  decomposition  is  in  order. 
Pages  of  the  FR  are  printed  with  three  text  columns.  At  times,  graphical  and  tabular  information  spans 
multiple  columns,  creating  relatively  dynamic  page  layouts.  Our  goal  is  to  produce  ground  truth  in 
"reading"  order,  therefore  automatic  detection  of  the  page  layout  is  critical.  A critical  aspea  is  accurate 
decolumnization  of  each  page. 

Of  the  three  OCR  products  tested,  two  have  automatic  page  decomposition  capabilities.  They  are  PC 
produa  B and  the  UNIX  product.  The  one  that  did  not,  PC  produa  A merely  reports  OCR  character 
results  in  a top-to-bottom  left-to-right  order  across  the  entire  page.  Of  the  two  that  did  decomposition,  PC 
produa  B failed  to  correcdy  decompose  3 of  the  15  pages,  whereas  the  UNIX  produa  only  failed  to 
decompose  1 of  the  pages  correctly.  Note  that  the  pages  in  this  evaluation  were  comprised  strictly  of  3 
text  columns.  Correa  decomposition  of  more  elaborate  FR  pages  is  a concern  with  all  of  the  products 
tested. 

Since  all  three  produas  had  problems  decomposing  one  or  more  FR  pages,  we  chose  to  compute 
charaaer  recognition  scores  using  manually  defined  zones.  In  our  application  of  generating  ground  truth, 
we  realize  that  manual  zoning  each  image  is  too  time  consuming  and  not  practical. 

3.2  OCR  Character  Error  Rates 

This  seaion  reports  the  charaao*  recognition  error  rates  measured  from  three  OCR  produas  aaoss  three 
categories  of  image  quality. 

Figure  5 plots  the  charaaer  recognition  error  rates  measured  on  the  five  FR  pages  of  good  quality.  In  the 
good  collection,  the  pages  contain  6130,  6739,  6483,  6346,  and  6189  charaaers  respeaively,  totaling 
31,887  charaaas.  AU  scores  are  within  a 1.5%  interval,  ranging  from  just  over  2%  to  0.5%.  PC  produa 
B performs  best  on  aU  but  the  last  page,  but  the  separation  in  these  scores  is  so  small  that  the  differences 
are  likely  to  be  statistically  insignificant.  (We  have  not  run  statistical  tests  of  significance  in  this 
experiment,  but  this  assation  is  supported  by  the  statistical  limits  reported  in  Reference  [5].  Tests  of 
statistical  significance  will  be  reported  in  future  METTREC  evaluations,  but  they  are  not  currently 
conduaed  in  the  UW  Scoring  Package.) 
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Figure  7.  OCR  Error  Rate  on  Ugly  Quality  of  Images 


Figure  6 plots  the  charaaQ"  recognition  oror  rates  measured  on  the  five  FR  pages  of  bad  quality.  In  the 
bad  collection,  the  pages  contain  6639,  6290,  6588,  7836,  7892  charaaers  respectively,  totaling  35,245 
charaaCTS.  Unlike  the  results  on  the  good  pages,  here  there  is  significant  separation  between  the 
products.  PC  produa  B and  the  UNIX  produa  are  tightly  grouped  (ranging  between  2%  and  5%), 
wha-eas  PC  product  A poforms  consistently  much  worse  (at  least  5%  worse  on  every  page). 


The  results  are  a little  more  mixed  in  Figure  7.  In  the  ugly  collection,  the  pages  contain  7285,  7123, 
7080,  6203,  and  7937  charaao-s  respectively,  totaling  35,628  characters.  As  can  be  observed,  the 
performance  has  fallen  off  dramatically  with  error  rates  reaching  as  high  as  74%.  PC  product  A actually 
performs  best  on  the  first  page,  but  last  on  pages  2 and  3.  PC  product  B tracks  the  UNIX  product  within 
6%  on  the  first  three  pages  with  performance  falling  off  on  the  last  two  pages.  The  UNIX  product  is 
within  3%  of  PC  produa  A on  the  first  page,  and  then  scores  the  best  on  the  remaining  4 pages. 
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An  interesting  pattern  can  be  observed  in  the  scores  plotted  in  Figure  7.  All  three  products  score  best  in 
the  ugly  set  on  pages  2 and  3,  with  PC  product  B having  an  error  rate  below  15%  and  the  UNIX  product 
below  10%.  The  lowest  error  rate  measured  for  any  system  on  the  other  3 pages  is  over  40%.  From  these 
observations,  there  appear  to  be  two  types  of  pages  represented  in  the  ugly  collection.  Upon  closer 
inspection  of  the  images,  this  was  confirmed.  Ugly  pages  2 and  3 contain  a significant  amount  of 
"pq^per"  noise  caused  by  ink  bleed-througL  Figure  8 shows  a subimage  containing  this  type  of  noise. 
Pages  1,  4,  and  5 contain  a diffo’ent  source  of  image  degradation.  In  these  pages,  the  printed  charaaers 
are  smudged  due  to  a problem  in  the  printing  process.  As  can  be  seen  in  Figure  9,  the  characters  appear 
to  have  been  typed  twice  (once  dark  and  once  light)  with  a small  translational  offset.  It  should  be  noted 
that  it  is  easio'  for  a human  to  read  the  text  in  Figure  8 than  the  text  illustrated  in  Figure  9.  The  latter 
requires  a greater  amount  of  word  level  context  (as  opposed  to  single  charaaer  context)  for  a human  to 
correctly  identify  a word. 


for  public  cpimheht,  comments  a^,  ; 
invited  oif  Ms  persons 

are  hivited  to  this  rule 

submitting  or 

argiimenti  as  n^y  tiesire. 
CbmmuiiiGeii^ll  idmiti^  the 
Rides  Ddd^tp^ber^^^i^  submitted 
in  spbd 

uiiderthetap^ 

Figure  8.  Pepper  Noise  from  Ugly  Page  2 


The  ExdBoggei  is  ai£na  pBopasmgtQ.  adsd 
Rule  to  ils  Ificdk 

spedife  listing’ czitoiiK^ 
secufitsss.’''  Iftaatot  fERqDosodihdb 
the  temn  “pBsnadsacushzes^’^  weoM  be 
defined  an  aaoBBzfibMi  «dueh  be 

tBanafeaawl  aBsdtnbHiailiyriBa^ 
cwmfahialiatt  tweth  ama  ancititerasa 
siQ§^  eecBSBiEic  neat- and  fiK’vdBcb  the- 
secainntB&  aatpcmlad  ltack-to>bade  on 


Figure  9.  Smudged  Characters  from  Ugly  Page  5 


Based  on  these  scores,  it  appears  that  the  vendors  of  PC  produa  B and  the  UNIX  produa  have  reasonably 
good  techniques  for  dealing  with  the  presence  of  peppa-  noise;  however,  their  error  rates  are  significantly 
higho"  (by  about  8%)  than  those  on  the  good  FR  pages.  In  contrast,  all  the  products  performed  poorly  on 
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the  pages  with  smudged  characters.  Perhaps  this  source  of  image  degradation  is  unique  to  the  FR  and/or 
its  publication  process. 

3.3  Timing  Results 

The  UNIX  product  apparently  detects  poor  image  quality  and  applies  additional  processing  resources  to 
obtain  a better  segmentation  and  classification.  Good  quality  images  were  optically  recognized  in  less 
than  60  seconds  (s),  averaging  45s.  Bad  quality  images  were  optically  recognized  in  60s  to  120s, 
avQ'aging  90s.  Ugly  quality  images  were  optically  recognized  in  over  120s,  averaging  160s. 

For  the  two  PC  products,  the  elapsed  time  to  OCR  an  entire  FR  page  was  invariant  with  the  quality  of  the 
image  and  averaged  between  35s-40s.  We  conclude  that  the  PC  products  are  engineered  primarily  with 
speed  in  mind  so  that  a somewhat  linear/homogeneous  solution  is  applied  regardless  of  the  quality  of  the 
image. 


4.0  CONCLUSIONS 

We  have  presented  the  results  of  a small  OCR  evaluation  in  which  three  different  vendor  products  (two 
Windows  NT/95-based  and  one  UNIX-based)  were  tested.  The  purpose  of  the  evaluation  was  to 
detarnine  the  state  of  commercial  OCR  technology  with  respect  to  processing  pages  of  the  Federal 
Register  (FR).  NIST  must  use  this  technology  in  order  to  produce  initial  text  to  image  alignments  for 
generating  ground  truth  in  future  METTEC  evaluations.  This  semi-automated  truthing  process  will  lower 
the  cost  of  preparing  testing  matmals  and  will  permit  experiments  to  scale  up.  Due  to  time  constraints 
and  the  current  cost  of  manually  prqjaring  ground  truth  for  documents,  fifteen  FR  pages  were  evaluated. 
Images  from  five  different  pages  were  visually  and  subjectively  selected  to  represent  each  of  three 
categories  of  progressively  worse  print  and  image  quality.  Though  a small  number,  these  pages  contained 
ovQ-  70,000  charaaers.  As  a result,  a number  of  interesting  conclusions  can  be  made. 

Working  with  the  products,  we  conclude  that  page  decomposition  is  a very  fragile  technology,  even  on 
well-formed  multi-column  pages.  Users  should  expea  a rdatively  high  error  rate  on  more  complex  page 
layouts.  Results  suggest  that  OCR  can  produce  good  recognition  results  (oror  rates  less  than  1%)  from 
high  quality  document  images.  On  the  other  hand,  current  OCR  technology  produces  dismal  results  (40% 
and  higher)  from  document  images  that  contain  poor  print  quality  and/or  a high  amount  of  image 
degradation.  We  did  observe  better  performance  on  documents  degraded  with  peppa  noise  than  those 
degraded  with  smudged  charaaas.  By  measuring  execution  times,  we  conclude  that  PC-based  products 
are  engineaed  primarily  for  speed  and  use  a static  algorithmic  solution  regardless  of  image  quality.  In 
contrast,  the  UNIX  produa  exhibited  the  ability  to  daea  low  quality  image  and  poor  recognition 
conditions  and  alter  its  solution  strategy  to  con^)ensate.  This  enables  a more  adaptive  and  potentially 
more  robust  solution  under  difficult  conditions.  Based  on  aU  these  faaors,  we  seleaed  the  UNIX  produa 
for  genaating  ground  truth  for  METTREC. 

When  we  searched  for  commercially  available  OCR,  we  found  only  a couple  of  UNIX-based  products  on 
the  marka.  As  companies  migrate  and  develop  OCR  technology  for  PC’s,  their  produas  are  being 
targaed  towards  GUI-based,  small-office  automation  applications  that  process  relatively  high  quality 
document  images.  In  the  end,  this  will  not  serve  the  needs  of  corporations  and  government  agencies  that 
require  the  processing  of  low  image  quality  documents  in  a centralized,  high-speed,  batch-oriented 
environment. 
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